Add federation proposal for cross-server agent communication by khaliqgant · Pull Request #8 · AgentWorkforce/relay

khaliqgant · 2025-12-21T09:18:57Z

Comprehensive design document for extending agent-relay to support
federated multi-server deployments while preserving the core
differentiator: automatic message injection via tmux.

Key design decisions:

Separation of concerns: routing (network) vs injection (local)
Hybrid topology: optional hub for discovery, direct peer connections
Progressive enhancement: single-server unchanged, federation opt-in
WebSocket + TLS for peer-to-peer daemon communication
Message queuing for resilience during disconnects

Includes:

Full protocol specification (PEER_HELLO, PEER_ROUTE, etc.)
Agent discovery and registry design
Security model (TLS, pre-shared tokens)
Configuration schema
CLI interface design
5-phase implementation plan (~4-5 weeks)

bd-TBD

Comprehensive design document for extending agent-relay to support federated multi-server deployments while preserving the core differentiator: automatic message injection via tmux. Key design decisions: - Separation of concerns: routing (network) vs injection (local) - Hybrid topology: optional hub for discovery, direct peer connections - Progressive enhancement: single-server unchanged, federation opt-in - WebSocket + TLS for peer-to-peer daemon communication - Message queuing for resilience during disconnects Includes: - Full protocol specification (PEER_HELLO, PEER_ROUTE, etc.) - Agent discovery and registry design - Security model (TLS, pre-shared tokens) - Configuration schema - CLI interface design - 5-phase implementation plan (~4-5 weeks) bd-TBD

Identifies major gaps and risks in the federation design: HIGH SEVERITY: - No end-to-end delivery guarantee (sender doesn't know if agent received) - Registry consistency race conditions (split-brain on name collisions) - Message ordering not guaranteed across servers MEDIUM SEVERITY: - Token management doesn't scale (N² tokens for N servers) - No message-level authentication (spoofing possible) - No rate limiting (flood attacks possible) - Debugging distributed failures is hard (no tracing) - NAT/firewall traversal not addressed - Timeline underestimated (8-10 weeks realistic vs 4-5 proposed) Includes: - Specific failure scenarios for each issue - Recommendations for fixes - Alternative approaches (NATS, SSH tunnels) - Suggested MVP scope to ship faster bd-TBD

Major additions to address identified issues: DELIVERY CONFIRMATION (Section 7): - End-to-end ACK so sender knows message was injected - Detection via capture-pane after send-keys - Optional confirmation notification to sender REGISTRY CONSISTENCY (Section 5.3): - Fleet-wide unique names (no split-brain) - Quorum-based registration with Lamport timestamps - Clear error on name collision with suggestions AUTHENTICATION (Section 8.1-8.3): - Ed25519 asymmetric keys (scales better than N² tokens) - Challenge-response handshake - Per-message signing to prevent spoofing - TOFU, static config, or CA options FLOW CONTROL (Section 9): - Credit-based flow control - PEER_BUSY/PEER_READY backpressure signals - Token bucket rate limiting (per-peer, per-agent, fleet-wide) - Bounded queues with drop policies TRANSPORT ABSTRACTION (Section 11): - Pluggable PeerTransport interface - WebSocket implementation (default) - NATS JetStream implementation (optional) - Migration path from WebSocket to NATS TIMELINE (Section 14.3-14.4): - Realistic estimate: 8-10 weeks (not 4-5) - MVP option for 4-week delivery - Phase 6: Stabilization added OPEN QUESTIONS (Section 16): - 10 unresolved questions for discussion - Recommendations for each - Clear decision points before implementation bd-TBD

Addresses storage requirements for federated deployments by separating: - Ephemeral storage (routing): Memory or NATS JetStream for message queues - Durable storage (trajectories): File/SQLite local + PostgreSQL/S3 central References the trajectories proposal (PR #3) for detailed format specification. Includes configuration examples and federation impact analysis.

Tasks organized by phase (1-5) and assigned to agent roles: - Architect: Protocol design, testing, docs (3 tasks) - Network: PeerConnection, PeerManager, reconnection, flow control (8 tasks) - Router: FleetRegistry, routing, broadcast, delivery confirmation (8 tasks) - Security: TLS, Ed25519 authentication (2 tasks) - Storage: Message queues, trajectory storage (3 tasks) - CLI: Fleet commands, config, dashboard (4 tasks) Dependencies mapped to ensure correct build order. See docs/FEDERATION_PROPOSAL.md for full specification.

Changes: - Add collaborators for cross-boundary tasks (8 tasks now dual-assigned) - Fix fed-014 dependency (queue can start after fed-004, not fed-013) - Add fed-026a for PeerTransport interface before NATS adapter - Add Architect review to Security tasks (fed-019) - Lower priority on fed-012 (loop prevention can merge with routing) Dual assignments: - fed-001: Architect + Network (protocol types) - fed-011: Router + Network (federated router integration) - fed-014: Storage + Network (message queue) - fed-018: Security + Network (TLS) - fed-019: Security + Architect (Ed25519 crypto review) - fed-022: Router + Network (delivery confirmation) - fed-023: Network + Storage (flow control) - fed-026a: Architect + Network (transport interface) - fed-027: Architect + Network + Router (integration tests)

Control Plane Tasks (12): - ctrl-001: Design Control API (REST + WebSocket) - ctrl-002: Lead Agent orchestration - ctrl-003: Web dashboard v2 (fleet control) - ctrl-004: Human authentication (OAuth/magic link) - ctrl-005: Push notification service (APNs/FCM) - ctrl-006: iPhone app MVP - ctrl-007: Slack/Discord bot integration - ctrl-008: Human escalation queue - ctrl-009: Agent skills registry - ctrl-010: Code Graph integration (from ai-maestro) - ctrl-011: Agent health monitoring (from ai-maestro) - ctrl-012: Agent portability export/import (from ai-maestro) Competitive Analysis: - ai-maestro uses file-based messaging (human relay required) - agent-relay uses auto-injection (truly autonomous) - Learn from ai-maestro: Code Graphs, health monitoring, portability - Our advantage: real-time messaging backbone

- agent-relay owns ephemeral storage (routing queues, ACKs, flow control) - agent-trajectories owns durable storage (trajectories, knowledge workspace) - Add event emission interface for agent-relay → agent-trajectories integration - Remove duplicate trajectory storage details (now in agent-trajectories repo) - Update summary to reflect separation of concerns

Decision: Use Mem0 (github.com/mem0ai/mem0) as memory layer for agent-trajectories rather than building from scratch. Why Mem0: - 25k+ stars, YC-backed, active development - Multi-LLM support (not just OpenAI) - MCP integration exists for Claude Code - Self-hosted option (Apache 2.0) - +26% accuracy vs OpenAI Memory benchmarks What we build on top: - Task-based trajectory grouping - Inter-agent event capture - Fleet-wide knowledge workspace - .trajectory export format New tasks (mem-001 through mem-005): - Integrate Mem0 SDK - Configure MCP for Claude Code - Build trajectory layer - Implement knowledge workspace - Abstract MemoryBackend interface See docs/MEMORY_STACK_DECISION.md for full rationale.

- Add section explaining MCP-based approach where Claude Code IS the LLM - Update integration examples to use infer:false (no API needed) - Add direct Qdrant alternative for simpler implementation - Document embedding options without paid APIs (Ollama, FastEmbed) - Update next steps to reflect MCP-first approach Key insight: With MCP, the agent handles intelligence, Mem0 becomes pure storage + vector search. No Anthropic SDK required.

@memory

- Define pattern namespace system (@relay:, @memory:, @Custom:) - Add hook lifecycle events (onSessionStart, onOutput, etc.) - Document HookContext and programmatic API - Add relay.config.ts configuration format - Create 6 hook-* tasks for implementation roadmap Hooks enable: - Automatic memory prompts at session end - User-defined pattern handlers - Integration points for extensions See docs/HOOKS_API.md for full design.

- Add detailed spec for each lifecycle event: - onSessionStart: when, trigger point, code example, use cases - onOutput: polling mechanism, handler signature, performance notes - onIdle: threshold config, once-per-period firing - onMessageReceived: suppress/modify capability - onSessionEnd: SIGINT handling, wait for response - Add HookEmitter class design - Add Event Summary Table - Create 7 new granular tasks (hook-007 to hook-013) Tasks cover: HookEmitter, each lifecycle event, and types.

Hook Context (read-only): - agentId, agentName, sessionId, workingDir, projectName - recentOutput (last 50 chunks), recentMessages (last 20) - Timing: sessionStartTime, lastOutputTime, idleSeconds Hook Result (allowed actions): - inject: max 2000 chars, sanitized - suppress: for onMessageReceived only - stop: prevent other handlers - sendMessage: one per invocation, max 5000 chars - log: audit log entry Prohibited: - File system access - Shell execution - Network requests - Env modification - Full output access Capability escalation via explicit config grants. Added hook-014 task for sandboxing implementation.

@ticket

Examples cover: 1. Memory integration - load context, prompt to save 2. Error detection - alert coordinator on failures 3. Message filtering - suppress/highlight by priority 4. Custom pattern - @ticket: handler 5. Coordinator hooks - special behavior for lead agent 6. Minimal config - just session end prompt 7. Debug mode - log all events Each example demonstrates: - Using HookContext (read-only) - Returning HookResult (inject, sendMessage, log, suppress) - Staying within sandbox limits

onSessionStart (2 examples): - Inject project context - Role-based context by agent name onOutput (3 examples): - Error detection and alerting - Progress tracking (test results) - Security keyword alerting onIdle (3 examples): - Escalating idle prompts (30s gentle, 2min urgent) - Auto-save reminder - Silent coordinator notification onMessageReceived (4 examples): - Custom formatting with priority - Suppress broadcasts while focused - Filter by sender whitelist - Transform task assignments onSessionEnd (4 examples): - Memory save prompt - Notify team of departure with duration - Request summary before exit - Silent logging only

Add -d/--detach flag to start agents in background, allowing SSH users to disconnect without losing agent sessions. Includes attach/kill commands for session management.

…ts test The test was checking if dataDir exists, but listProjects() requires the .project marker file to be present.

…iqgant/agent-relay into claude/continue-pr-8-7tWzb

…continue-pr-8-7tWzb

feat: Add detached mode for long-running agent sessions

Port the hooks API design document from PR #8 with additional trajectory integration examples showing how hooks can work with the PDERO paradigm and trail CLI.

This document supersedes the original federation proposal with a realistic assessment of what's built today and a detailed roadmap for achieving the N-servers-per-org vision. Key sections: - Current state analysis with file references - Gap analysis comparing PR #8 proposal vs reality - Target architecture with org-centric model - 5-phase implementation roadmap (9 weeks total) - Per-user team pricing model - Technical specs for P2P protocol and agent registry Related: #8

Added Appendix B with detailed solutions for distributed systems challenges identified in PR #8's review: Critical (🔴): - End-to-end delivery confirmation via capture-pane verification - Registry consistency using cloud as authoritative source - Message deduplication with TTL-based seen set High Priority (🟡): - Backpressure with PEER_BUSY/PEER_READY and bounded queues - Distributed tracing with correlation IDs Medium Priority: - NAT/firewall traversal with hybrid topology - Clock skew handling via relative TTLs Also preserved PR #8's detailed protocol specification (PEER_HELLO, PEER_ROUTE, etc.) and hybrid topology recommendation. The document now serves as the authoritative architecture reference, superseding PR #8 while incorporating its valuable insights.

khaliqgant · 2026-01-07T06:20:15Z

Closing in favor of #91

…tart Three bug fixes reported during MCP testing: 1. **Spawn race condition fix** (Bug #10): Added spawningAgents mutex to prevent concurrent spawn requests for the same agent from both passing the activeWorkers.has() check before either completes. 2. **SIGKILL diagnostics** (Bug #7, #10): Added gatherSigkillDiagnostics() to capture memory usage, process count, and OOM killer messages when exit code 137 or SIGKILL is detected. This helps diagnose resource exhaustion issues. 3. **Orphan cleanup** (Bug #8, #9): Added cleanupOrphanedWorkers() that runs on spawner startup to kill relay-pty processes from a previous daemon run. This ensures a clean slate after daemon restarts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

template-resolver.ts: shell-escape interpolated variables (CRITICAL #1) broker_tests.rs: uncomment and wire up 5 real tests (CRITICAL #2) worker_tests.rs: uncomment and wire up 5 real tests (CRITICAL #3) worker.rs: log bypass-flag injection, add .. path traversal rejection (CRITICAL #4, #7) verification.ts: export stripInjectedTaskEcho, add path traversal guard (CRITICAL #5) runner.ts: remove duplicate stripInjectedTaskEcho, add ENV_ALLOWLIST filtering (HIGH #17) channel-messenger.ts: add secret scrubbing, hoist regex constants (MEDIUM #27, #28) process-spawner.ts: add settled guard for race condition (MEDIUM #23) step-executor.ts: add sideEffects to callback type, deprecate alias (HIGH #15, #16) index.ts: export StepExecutor directly (MEDIUM #29) workflows/refactor/*.ts: replace hardcoded paths, remove --no-verify (HIGH #8-11) broker.rs: move is_pid_alive to canonical location (HIGH #14) cost/tracker.ts: add restrictive file permissions (MEDIUM #30) cost/pricing.ts: add last-verified date (MEDIUM #31) verification.test.ts: 9 new tests for exported helpers (MEDIUM #32) Co-Authored-By: My Senior Dev <dev@myseniordev.com>

…#675) * refactor: TDD decomposition of runner.ts + main.rs with extracted modules Extracted 5 modules from runner.ts (6,878 lines): - verification.ts (143 lines) - template-resolver.ts (87 lines) - channel-messenger.ts (151 lines) - step-executor.ts (571 lines) - process-spawner.ts (96 lines) Added characterization tests for all extracted modules. Extracted broker.rs and worker.rs from main.rs. Bug fixes: - Restore stripInjectedTaskEcho in verification.ts - Guard agent.release() against broker 400 race condition - Fix run-summary-table test for new table format - Export normalizeModel for correct pricing resolution - Fix --wave argument parsing in run-refactor.ts - ESM imports in all workflow files * fix: address 10 review finding(s) tracker.ts: resolveModel now uses normalizeModel for alias resolution (pre-existing fix verified) run-refactor.ts: --wave parsing with proper validation (pre-existing fix verified) step-executor.ts: signal-killed processes now correctly treated as failures channel-messenger.ts: replaced ReDoS-vulnerable regex with iterative indexOf stripping runner.ts: eliminated shell injection by using direct git spawn with argument arrays process-spawner.ts: fixed SIGKILL fallback timer leak by storing and clearing reference Co-Authored-By: My Senior Dev <dev@myseniordev.com> * Revert "chore: gitignore .trajectories/ (automated run artifacts) (#676)" (#677) This reverts commit 07a8dc0. * refactor: TDD decomposition of runner.ts + main.rs with extracted modules Extracted 5 modules from runner.ts (6,878 lines): - verification.ts (143 lines) - template-resolver.ts (87 lines) - channel-messenger.ts (151 lines) - step-executor.ts (571 lines) - process-spawner.ts (96 lines) Added characterization tests for all extracted modules. Extracted broker.rs and worker.rs from main.rs. Bug fixes: - Restore stripInjectedTaskEcho in verification.ts - Guard agent.release() against broker 400 race condition - Fix run-summary-table test for new table format - Export normalizeModel for correct pricing resolution - Fix --wave argument parsing in run-refactor.ts - ESM imports in all workflow files * trajectories correction again * pre commit is executable * remove tracked workflows * fix: address 36 review findings across Rust and TypeScript modules template-resolver.ts: shell-escape interpolated variables (CRITICAL #1) broker_tests.rs: uncomment and wire up 5 real tests (CRITICAL #2) worker_tests.rs: uncomment and wire up 5 real tests (CRITICAL #3) worker.rs: log bypass-flag injection, add .. path traversal rejection (CRITICAL #4, #7) verification.ts: export stripInjectedTaskEcho, add path traversal guard (CRITICAL #5) runner.ts: remove duplicate stripInjectedTaskEcho, add ENV_ALLOWLIST filtering (HIGH #17) channel-messenger.ts: add secret scrubbing, hoist regex constants (MEDIUM #27, #28) process-spawner.ts: add settled guard for race condition (MEDIUM #23) step-executor.ts: add sideEffects to callback type, deprecate alias (HIGH #15, #16) index.ts: export StepExecutor directly (MEDIUM #29) workflows/refactor/*.ts: replace hardcoded paths, remove --no-verify (HIGH #8-11) broker.rs: move is_pid_alive to canonical location (HIGH #14) cost/tracker.ts: add restrictive file permissions (MEDIUM #30) cost/pricing.ts: add last-verified date (MEDIUM #31) verification.test.ts: 9 new tests for exported helpers (MEDIUM #32) Co-Authored-By: My Senior Dev <dev@myseniordev.com> * style: auto-format Rust code with cargo fmt * minor clean * fix: reinstate deleted workflow files into workflows/ci/ Moved fix-mcp-spawn.yaml, add-swift-sdk.ts, and cli-observability.ts into workflows/ci/ to clearly distinguish them as CI test suite workflows. Updated .gitignore to allow workflows/ci/ and workflows/refactor/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address remaining Devin review findings and fix failing test - Fix tracker test: expect mode: 0o700 in mkdirSync assertion - Use Object.hasOwn() instead of `in` operator to avoid prototype chain false positives - Use Promise.allSettled to preserve partial output on process timeout - Apply path containment check for absolute paths in checkFileExists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address new Devin review findings — StepExecutor name collision and cwd trailing slash - Rename StepExecutor interface in runner.ts to RunnerStepExecutor to avoid shadowing the StepExecutor class export in the barrel index - Normalize cwd with path.resolve() in checkFileExists to handle trailing slashes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nessSpec for codex models - Fix parseHarnessSpec: codex models (gpt-5.4-mini, o3, etc.) are raw OpenAI model names and must not be qualified with a "codex/" prefix — add early-return for cli=codex - Add codex tier comparison section to eval-master-summary.html: gpt-5.5 recommended (16/16, 0% phantom), gpt-5.4-mini viable budget (15/16, 31% phantom), gpt-5.4 avoid (52% phantom despite 100% majority-vote), spark not viable (6/16) - Update action item #8 in HTML from "pending" to completed findings with tier table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

#1109) * docs(evals): add master plan for relay SDK eval suite Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(evals): add op-arg reference + executor result contract Folds in the op argument and check-key conventions surfaced by the first authoring round so the relaunched workers start from a concrete spec. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(evals): mark op vocabulary LOCKED per W1 confirmation Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Add messaging relay eval suites * Add session listener facade eval suites * Add channel workspace agent directory eval suites * Add delivery actions eval suites * Relax W3 eval content expectations * Align W3 eval duplicate error codes * Relax delivery action eval expectations * Align W5 eval cases with runner output * Align workspace eval assertions with runner output * Mark capability eval hooks pending * Relax unknown agent removal eval assertion * feat(evals): add relay eval harness * chore: apply pr-reviewer fixes for #1092 * Add local workflow run commands * Avoid log read file race * Use Relayflows for local workflow runs * Add agent messaging eval suite and reports Introduce a new integration eval suite that exercises agent-to-agent messaging via the broker and scores protocol adherence (message-sent rate, phantom messages, ACK/DONE protocol, wrong-channel replies). Adds a full eval runner, scenarios, deterministic scoring, reporters (JSON + self-contained HTML viewer), a matrix roll-up, unit tests for scoring, and CLI helpers under tests/integration/broker/evals. Adds npm scripts (eval:build, eval:unit, eval:selftest, eval:toolcheck, eval:html, eval, eval:claude, eval:matrix) and gitignore entry for evals-reports. Also adds a Fleet Delivery design doc (specs/fleet-delivery.md), updates CHANGELOG.md, and adjusts integration test config/files (tsconfig, vitest, and broker harness utilities) to align the broker-harness with the current SDK/harness-driver APIs so the evals build/run cleanly. * feat(evals): add spawn/release reliability eval suite with onboarding variants Adds three new scenarios (s01-spawn-worker, s02-release-worker, s03-lifecycle) that test whether agents reliably call add_agent and remove_agent across four onboarding variants (bare, one-liner, brief, skill) — lightest to heaviest — to find the minimum text that achieves 10/10 reliability. Ground truth uses broker events (agent_spawned/agent_released) not text parsing. Adds opencode free-model support via --harness=opencode:mimo-v2-flash-free. Adds npm scripts: eval:lifecycle, eval:lifecycle:free, eval:lifecycle:matrix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: apply pr-reviewer fixes for #1109 * feat(evals): multi-model lifecycle evals, HTML report redesign, 4 new scenarios Scoring fix: remove parent===leadAgent filter from scoreSpawn and waitForEvent — the broker HTTP API never emits parent in agent_spawned events, causing all spawn scores to return FAIL. Any agent_spawned event is now trusted as ground truth (eval scenarios have exactly one lead). Model threading: ScenarioContext.model + BrokerHarness.spawnAgent model option propagate claude --model flags through the eval harness. Runner parses claude:haiku/sonnet/opus into full model IDs; opencode model set via OPENCODE_MODEL env var. New eval scripts: eval:lifecycle:claude-models (haiku/sonnet/opus) and eval:lifecycle:full (all models + opencode + codex). 4 new realistic messaging scenarios: t01-thread-reply, r05-check-inbox, r06-group-dm, r07-list-agents — bringing the total to 12 lifecycle + 12 messaging = 24 scenarios. HTML report redesigned: CI dashboard dark theme, per-model lifecycle variant breakdown table, spawn/release rate bars, transcripts collapsed by default, scenario groups separated. SKILL.md: fix all mcp__relaycast__ → mcp__agent-relay__ and hierarchical tool names to registered flat names (eval:toolcheck now passes). Onboarding variants: one-liner explicitly names mcp__agent-relay__ prefix; skill variant cleaned up (removed confusing dual-form syntax). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(evals): s04 native-subagent detection scenario Adds s04-no-native-subagents (4 onboarding variants) — a lifecycle scenario that specifically detects whether Claude falls back to its built-in Task tool instead of mcp__agent-relay__add_agent when asked to spawn parallel workers. Ground truth: agent_spawned broker event = relay tool used (PASS). Detection: worker_stream contains "Task(" with no agent_spawned = native subagent confirmed (FAIL, notes distinguish native vs phantom vs no-spawn). New scoring/native-subagent.ts exposes detectNativeSubagent() which scans cleanStreamOutput for the Task( invocation pattern Claude Code emits. HTML report: FAIL cards for s04 show a red "NATIVE TASK" pill in the header and a "tool: NATIVE TASK (not add_agent)" stat entry when detected. ScenarioResult gains nativeSubagentDetected?: boolean used by both the report renderer and the scenario notes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(broker): model-aware relay skill injection for small-tier models Eval data showed haiku achieves 0/5 spawn reliability without onboarding guidance and 5/5 with the full skill variant. Sonnet/Opus pass bare (0-shot). When an agent is spawned via the HTTP API (add_agent MCP path) with a small-tier model (haiku, gpt-*-mini, gemini-*-flash), the broker now automatically prepends a concise relay skill block to the initial task: ## Agent Relay — Worker Management ### Spawn a worker mcp__agent-relay__add_agent(name, cli, task) ### Release a worker mcp__agent-relay__remove_agent(name) This happens in api.rs after workers.spawn() returns the effective spec (normalized model), so the prefix uses the resolved model ID. The prefix is appended before the task is stored in initial_tasks and delivered as the agent's first message. Large models (sonnet, opus, pro, gpt-4o) receive no prefix — they reliably call the relay tools from context alone and the extra text is unnecessary. 4 unit tests cover haiku/mini/flash (prefix), sonnet/opus/pro (no prefix), and None model (no prefix). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(evals): comprehensive opencode model lifecycle eval scripts Add 5 new eval scripts covering the full opencode model registry: eval:lifecycle:opencode-native — opencode-specific models (mimo-v2-flash-free, minimax-m2.5-free, big-pickle, gpt-5-nano) eval:lifecycle:opencode-gpt4 — GPT-4 family via opencode (gpt-4o, gpt-4o-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano) eval:lifecycle:opencode-gpt5 — GPT-5 family (gpt-5, gpt-5-mini, gpt-5-nano, gpt-5.2, gpt-5.4) eval:lifecycle:opencode-reasoning — o-series reasoning (o1-mini, o3-mini, o4-mini, o3, codex-mini-latest) eval:lifecycle:opencode-all — all 18 opencode models in one sweep Each runs 5 repeats × 16 lifecycle scenarios (s01–s04, 4 onboarding variants). Models are passed via OPENCODE_MODEL env var; report label uses the short model name (e.g. opencode:gpt-4o) via the existing .split('/').pop() path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(evals): s05 phrasing variants, auto-routing spec, skill text fix, opus timeout fix - Add s05-phrasing-variants.ts: 6 scenarios testing relay vocabulary in task prompts (neutral-worker/agent, relay-worker/agent, arw-worker/agent) with bare onboarding to isolate the pure phrasing effect on spawn reliability - Wire s05 into runner as --group=phrasing; add eval:phrasing and eval:phrasing:claude-models scripts; add eval:phrasing:all-harnesses - Add dedicated lifecycle eval scripts for all installed harnesses: codex, gemini, grok, cursor, droid; add eval:lifecycle:all-harnesses - Fix onboarding.ts skill text: replace "do it yourself for quick lookups" heuristic with explicit rule that honours direct delegation instructions; this was causing sonnet s01:skill=0% and opus s03:skill=0% - Make s03 response window model-aware: opus gets 120s per phase (up from 60s) via responseMs(model) helper; overall scenario timeout bumped to 300s - Add specs/auto-routing.md: task classifier + team composer design grounded in lifecycle eval results (sonnet+one-liner=100% lead, haiku=worker-only, opus timeout-limited on s03) - Record lifecycle eval b3oqx02zv findings in trail trajectory Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(auto): add auto-routing Phase 1–3 — classifier, composer, Director prompt - packages/cli/src/auto/classifier.ts: heuristic task classifier (no extra LLM call); returns complexity/parallelizable/domains/estimatedWorkers in <1ms - packages/cli/src/auto/composer.ts: routing table → TeamSpec; lead is always sonnet (one-liner) or opus (bare); haiku is worker-only per eval data - packages/cli/src/auto/director-prompt.ts: builds pre-composed Director meta-prompt so lead only coordinates — uses "relay worker" noun from s05 phrasing eval (early data shows relay-worker = 60% vs neutral-worker = 0% for haiku with bare onboarding) - packages/cli/src/auto/index.ts: barrel export - 13/13 unit tests passing for classifier and composer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(evals): add s06 auto-routing Director scenario and auto-routing eval group - s06-auto-routing.ts: validates Director multi-worker spawn using pre-composed meta-prompt; PASS requires ≥2 relay agent_spawned + no native Task tool usage - scenarios/index.ts, runner.ts: register s06 under --group=auto-routing - package.json: add eval:auto-routing and eval:auto-routing:claude-models scripts - Runner header updated to document auto-routing group Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): update auto-routing spec with confirmed timeout fix and s05 early phrasing data - Opus s03 bare=67% (up from 40%) after responseMs() 120s/phase fix, confirmed - s05 phrasing: haiku relay-worker=60% vs neutral-worker=0%, worker noun outperforms agent noun across all vocabulary tiers for haiku with bare onboarding - arw-worker (agent-relay worker) matching relay-worker performance at 100% so far - Updated open questions table with confirmed answers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): update auto-routing spec with skill text analysis and phrasing finding The task vocabulary is the bigger spawn-reliability lever, not onboarding text: - Skill text heuristic fix: sonnet s01:skill 0%→33% (partial improvement) - Root cause: task uses neutral 'worker agent' vocabulary; relay-worker phrasing yields 60% vs neutral-agent 20% for haiku with bare onboarding (s05 confirms) - Production fix already in place: Director meta-prompt uses 'relay worker' Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): strengthen skill text with explicit relay-tool disambiguation Add explicit note in the skill onboarding variant to prevent models from confusing 'assign to a worker agent' in the task with their built-in Task capability. Uses 'relay worker' vocabulary in section headings for consistency with s05 phrasing eval findings (relay-worker=60% vs neutral-agent=20%). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): add routing table rationale — conditional guidance hurts capable models opus s03 brief=0% vs bare=67%, one-liner=67% confirms that conditional spawn guidance ('Spawn when... dedicated focus') gives capable models permission to skip delegation. Routing table correctly uses bare/one-liner for leads only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): revise brief onboarding to remove conditional spawn guidance 'Spawn when a task needs dedicated focus' gave capable models (opus) permission to skip delegation. Now uses directive language: 'When the task says to delegate or assign work, call add_agent.' Also uses relay-worker vocabulary for consistency with s05 phrasing findings. Validated: opus s03 brief was 0% with old conditional text. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): update opus s03 results table with confirmed timeout fix findings - bare=67%, one-liner=67%, brief=0% (fixed), skill=67% with 120s/phase timeout fix - All variants at 67% except brief which used a conditional spawn clause (now fixed) - brief onboarding reverted to directive language in codebase Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(broker): fix grok mcp add — NAME-first ordering and embed flag-args in command grok v0.2.x has two parser quirks vs standard CLI conventions: 1. The positional <NAME> argument must come before any options (--env, --command, --args) 2. Flag-shaped --args values like `-y` are rejected as unknown options Fix both in `grok_mcp_add_args`: move AGENT_RELAY_MCP_SERVER to position [2] (immediately after "mcp add"), and embed flag-shaped args into the --command string ("npx -y") rather than passing them via --args. Also: wire Phase 4 auto-routing into local agent spawn command (--model auto triggers classifyTask → composeTeam → buildDirectorPrompt and spawns a Director with the right model tier), and update eval scripts to add all-harnesses phrasing script with current opencode free model names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): add multi-harness eval findings to auto-routing spec - Add non-Claude lifecycle table: codex/gemini 100%, droid 80%+, grok/cursor 0% - Add phrasing matrix (early data): non-Claude vocabulary-agnostic, Claude vocab-dependent - Key insight: relay-anchored nouns matter only for Claude; codex/droid/gemini/opencode achieve high pass rates with neutral vocabulary - Update open-questions table with cross-harness status Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): update multi-harness phrasing table with 5-run results Codex: 100% all 6 variants (one 80% outlier on relay-agent) OpenCode: 100% on 5/6 variants, 80% neutral-agent Droid: 80-100% across all tested variants Gemini: 60-100% across tested variants Grok/Cursor: 0% all — behavioral non-starters Claude haiku: 0-60% vocabulary-dependent Claude sonnet: 0-40% even with relay-anchored vocabulary Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): clarify grok failure mode post MCP fix + confirm opencode phrasing complete - Grok: MCP config errors gone (exit 2 fixed); failure is now behavioral (model ignores relay tools despite MCP being available) - OpenCode phrasing complete (29/30 = 97%): perfect across 5 variants, one neutral-agent miss; relay-native with no vocabulary dependency - Mark opencode arw-agent column as complete in phrasing table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): add s01/s02/s03 lifecycle breakdown tables for non-Claude harnesses Key findings: - Codex: 100% s01+s02 all onboardings; s03 bare=80%, one-liner=100% - OpenCode: s03 bare=100% (best full-lifecycle result); brief onboarding weak on s01/s02 - Gemini: perfect spawn (s01=100%) but release degrades without onboarding - Droid: good spawn but nearly 0% release (s02) without skill onboarding - Release issue for droid/gemini compensated by explicit Director prompt Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): update phrasing table with droid/gemini/grok/cursor data - Droid: 80-100% across all variants, relay-worker=100% outperforms neutral - Gemini: relay-agent=25% outlier (relay+agent confused model); relay-worker=80% - Grok: 0% all variants confirmed - Cursor: 0% all variants confirmed - Add universal recommendation: use 'relay worker' noun — best across both Claude and non-Claude models; avoid 'relay agent' (hurts Gemini badly) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): update s03 lifecycle table with codex/opencode confirmed results codex s03: bare=80%, one-liner=100% opencode s03: bare=100% (best), one-liner=80%, brief=100% (so far) Still running: codex s03 brief/skill, opencode s03 skill, gemini/droid s03 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): emphasize parallel spawn in s06 Director prompt Models were calling add_agent for first worker then processing ACK DMs before spawning the second worker, causing 0% multi-spawn pass rate. Adding explicit CRITICAL instruction to execute all spawn calls back-to-back without waiting between them, and to ignore ACK DMs until all workers are spawned. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): update auto-routing spec with complete s03/s05 non-Claude results - s05 phrasing: all non-Claude harnesses complete (5x5 runs each) - codex: 80-100% on all vocabulary variants (relay-native) - opencode: 80-100% across all variants - droid: 80-100%; relay-worker=100%, arw-agent=100% - gemini: neutral-agent/arw-agent=100%; relay-agent=40% (relay+agent suffix confuses gemini) - s03 full lifecycle: codex brief/skill=100% (confirmed); droid bare=100% (surprising given s02 bare=20%; directive task phrasing drives release); gemini one-liner=100% - Eval coverage table updated: s05 complete, s03 non-Claude answers resolved - s06 status: 0% with original sequential prompt; re-running with parallel-spawn fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): update s03 with droid/gemini confirmed results droid: bare/one-liner/brief all 100% (surprising given s02 bare=20%; directive task phrasing "report DONE when complete" drives release even without explicit onboarding) gemini: one-liner/brief/skill all 100%; bare=60% is the only gap. Implication: Director prompt's explicit remove_agent calls + directive task phrasing substitute for skill onboarding on all viable harnesses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): inject fake worker ACK in s06 to unblock Director second spawn Root cause: models spawn first worker then wait indefinitely for an ACK that never comes (worker task prompt doesn't include relay ACK instruction). Director never spawns the second worker within phaseMs. Fix: after the first agent_spawned event, inject an ACK message from that worker to the Director. This unblocks the Director's wait state and triggers the second spawn call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): extend s06 spawn window to 180s and simplify ACK text Extended spawn detection from phaseMs (60s) to 180s to accommodate sequential spawning: Director spawns first worker, waits for ACK, then spawns second worker within the extended window. Simplified injected ACK text — "report DONE when complete" in the previous text was confusing models into thinking the worker was done. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): change s06 injection to Orchestrator nudge after first spawn Previous approach (Worker ACK injection) didn't trigger the second spawn because Director processed the ACK passively. Changed to an Orchestrator message that explicitly instructs the Director to spawn Worker-Frontend. This reflects the finding that models require external stimulus to chain multiple add_agent calls — they don't spontaneously spawn the second worker after the first add_agent returns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): document s06 multi-spawn finding and add to open questions Key finding: models make one add_agent call per response then stop/wait. No prompt instruction (CRITICAL warnings, numbered mandatory actions) reliably chains two consecutive add_agent calls in PTY mode. External Orchestrator nudge after first spawn is required. This is a Phase 5 concern but should inform Phase 3 Director prompt design — consider pre-wiring workers before Director starts, or using sequential hand-off patterns instead of parallel spawn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): s06 v6 — Director executes from meta-prompt alone, no trigger message All previous approaches (CRITICAL instruction, ACK injection, Orchestrator nudge) failed because an incoming message triggers only a partial response (single add_agent call). Switching to production-matching approach: Director spawns workers autonomously from its meta-prompt without any trigger message. Extends wait window to 180s from STARTUP_MS offset to cover Director boot + sequential spawn pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(specs): update s04 table with codex/gemini/droid results codex: 100% all variants — never routes to native subagents droid: 0% on bare/one-liner (uses native Task without disambiguation); skill text that fixes s04 for Claude breaks droid s03 (0%) gemini: 80% bare, 60% one-liner, 80%+ brief/skill — mostly relay-native Key insight: droid needs skill onboarding to avoid native subagents (s04), but the SAME skill text that helps s04 breaks s03 — needs investigation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): s06 v7 — count startup spawns before clearEvents The Director processes its task immediately after boot. clearEvents() after STARTUP_MS was discarding agent_spawned events fired during startup. Now count spawns that happened during startup before clearing, then continue watching for more spawns with the 180s window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(evals): complete s05 phrasing data for all 7 harnesses; fix s06 scoring; document PTY multi-spawn limitation - s05 phrasing eval complete across all harnesses: opus achieves 100% on relay-worker/relay-agent/arw-worker/arw-agent (vs haiku 60%/20%/60%/40%), confirming relay-anchored vocabulary is required for Claude with opus showing the strongest native tool knowledge - s06 scoring fix: remove clearEvents() before scoring so startup spawns are counted by scoreSpawn; use plain sleep() instead of waitForEvent (which resolves immediately on buffered events) - s06 architectural conclusion: PTY-mode agents make exactly one tool call per turn — 0% multi-worker spawn across all harnesses and 8+ prompt approaches. Production fix: pre-spawn workers from CLI layer in Phase 4, not via Director - Update specs/auto-routing.md with opus phrasing table, s06 final answer, and revised open questions - Update CHANGELOG with complete phrasing findings and s06 architectural decision Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(cursor): resolve cursor CLI to cursor-agent binary, not grok's agent binary The broker's command_parse was mapping cli="cursor" to "agent", but the `agent` binary on PATH is /Users/khaliqgant/.grok/bin/agent (the Grok CLI). This meant all cursor eval runs were actually running Grok, explaining the 0% scores. Fix: map cursor → cursor-agent explicitly in command_parse. Update harness command to match. Update unit tests accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(evals): cursor-agent confirmed 0% — not viable relay worker Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(evals): add codex model tier comparison results and fix parseHarnessSpec for codex models - Fix parseHarnessSpec: codex models (gpt-5.4-mini, o3, etc.) are raw OpenAI model names and must not be qualified with a "codex/" prefix — add early-return for cli=codex - Add codex tier comparison section to eval-master-summary.html: gpt-5.5 recommended (16/16, 0% phantom), gpt-5.4-mini viable budget (15/16, 31% phantom), gpt-5.4 avoid (52% phantom despite 100% majority-vote), spark not viable (6/16) - Update action item #8 in HTML from "pending" to completed findings with tier table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(evals): add opencode batch eval script and codex tier constants to composer - Add run-opencode-models.sh: two-phase deterministic batch eval over 39 opencode model tiers. Phase 1 screens with s03 repeat=3; Phase 2 runs s01-s04 repeat=5 for models that score ≥67% in Phase 1. No LLM coordinator — pure shell loop. - Add CODEX_MODEL_TIERS, WorkerCli type, and per-harness HARNESS_ONBOARDING defaults to composer.ts based on lifecycle eval findings (2026-06-12) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): harden opencode batch script — tolerate per-model failures, add missing claude variants - run_eval: add || true so one model failure doesn't abort the entire batch - Expand model list from 41 to 45: add claude-sonnet-4, claude-opus-4-1/4-5/4-6/4-7 to match actual opencode models output Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): scope opencode batch to alternative/Chinese models only Drop Claude and OpenAI variants — those are already covered by native CLI evals. Keep: DeepSeek, Kimi, Qwen, Minimax, GLM, MiMo, Grok-via-opencode, Gemini-via-opencode, Nemotron, North-mini, big-pickle (19 models total). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(auto): add role-fit classification layer to composer and auto-routing spec - Add AgentRole type (lead, coordinator, worker, planner, reviewer, critic, verifier, judge, mapper, reducer, supervisor, debater) matching the choosing-swarm-patterns skill's role taxonomy - Add HarnessRoleMap + HARNESS_ROLE_MAP: eval-backed table mapping each harness to the roles it can fill (confirmed/provisional/not-viable/untested) - Add harnessesForRole() helper for pattern-aware slot selection - Expand WorkerSpec to carry cli, codexModel, opencodeModel - Add §5 Role-Fit Classification to specs/auto-routing.md: role definitions, eval signal → role mapping, provisional role-fit table, pattern → role → harness assignment table, and s07-s11 scenario roadmap for full coverage opencode Chinese/alternative model rows marked 'untested'; will update from Phase 1+2 batch eval results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(evals): add Phase 1 opencode batch results to role-fit map 19 alternative/Chinese models evaluated (s01-s04, repeat=3). 17/19 advance. Top tier (16/16, 0-2 phantoms): deepseek-v4-flash, deepseek-v4-flash-free, qwen3.6-plus, qwen3.5-plus, minimax-m2.5, minimax-m2.7, glm-5.1, big-pickle Confirmed (16/16, 3-9 phantoms): glm-5, gemini-3.1-pro, grok-build-0.1, gemini-3-flash Provisional (12-15/16): kimi-k2.5, kimi-k2.6, mimo-v2.5-free, gemini-3.5-flash, north-mini-code-free Eliminated: deepseek-v4-pro (11/16), nemotron-3-ultra-free (10/16) Key findings: - opencode normalizes flaky native CLIs: gemini bare 60% native → 16/16 via opencode; grok 0/48 native → 16/16 via opencode (model capable, native CLI MCP was broken) - All top-tier Chinese models are relay-native across all 4 onboarding variants - HARNESS_ROLE_MAP updated with per-model entries for all 19 tested models Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(evals): add role-fit rankings and communicator sections to master summary Adds §7 "Role Rankings — Top 5 per Role" to eval-master-summary.html: - Lead/Coordinator: claude:sonnet > opus > codex:gpt-5.5 > opencode:deepseek-v4-flash > opencode:qwen3.6-plus - Worker: codex:gpt-5.5 > opencode:deepseek-v4-flash > minimax-m2.5 > qwen3.5/3.6-plus - Reviewer/Critic: claude:opus > sonnet > codex:gpt-5.5 > opencode:gemini-3.1-pro - Mapper/Reducer: codex:gpt-5.5 > opencode:deepseek-v4-flash > minimax-m2.5 > qwen3.6 > glm-5.1 - Judge: claude:opus > sonnet > codex:gpt-5.5 - Communicator: claude:opus > sonnet > codex:gpt-5.5 > opencode:gemini-3.1-pro > opencode:deepseek-v4-flash - Best lead callout: claude:sonnet + one-liner is the clear #1 across all criteria Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(evals): add @agent-relay/evals package with publish pipeline Creates packages/evals/ as the shared eval harness package: - package.json with subpath exports for types, runner, harness, scoring, and scenarios - tsconfig.json matching the monorepo TypeScript config - src/types.ts (canonical copy of EvalScenario/ScenarioResult/etc. types) - src/harness.ts (BrokerHarness interface for downstream type-checking) - src/index.ts (barrel re-export) Adds build:evals to root build:core chain (after harness-driver, before cli). Updates publish.yml: converts publish-harnesses to a matrix job covering both harnesses and evals — same dependency pattern (needs publish-packages so harness-driver is on the registry before evals can be installed). Also adds specs/agent-relay-evals-package.md scoping the full migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: auto-format Rust code with cargo fmt * feat(evals): add s07-lead-delegation scenario group Three new eval scenarios that test lead role discipline — the gap that existing s03/s04 don't cover (they only check spawn mechanics, not whether the lead actually delegates vs. self-implements): l01 — Unconditional delegation: coding task with no "don't do it yourself" hint. Good lead spawns; bad lead writes code itself. l02 — Temptation resistance: task explicitly says "you'd be faster doing it yourself". Lead must still delegate. l03 — Post-delegation synthesis: after workers report DONE, lead must send a synthesis message referencing their results. New scoring fields on ScenarioResult: - selfImplemented: lead's PTY stream contained code blocks / impl tool calls - synthesisOk: lead sent a synthesis message after receiving DONE All three run across bare/one-liner/brief/skill onboarding variants (12 total). Runner: --group=lead-delegation, npm scripts eval:lead-delegation, eval:lead-delegation:claude-models, eval:lead-delegation:all-harnesses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(mcp): pass model via metadata in add_agent spawn call SpawnAgentRequest has no top-level model field; model must be passed through metadata so the broker can forward --model to the launched CLI. Also auto-format prettier fixes in auto/ and local-agent.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): bump @agent-relay/evals to 8.6.0 and regenerate lock file Package was pinned at 8.3.1 causing npm ci to fail — lock file was missing @agent-relay/evals and several stale version entries. Updated to match monorepo version 8.6.0 and regenerated package-lock.json. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: auto-format all files with prettier; add broker gitignore for eval artifacts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): use rk_live_ substring check instead of glob in workspaces eval contentIncludes does literal substring matching; rk*live* was treated as the literal string, not a glob. The mock key rk_live_ws_eval_create matches rk_live_ as a prefix check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: auto-format with Prettier * fix(ci): resolve three CI failures on PR #1109 1. composer.test.ts: check for role === 'reducer' not 'Synthesiser'; the AgentRole type uses 'reducer' for aggregator workers. 2. packages/cli/package.json: bump all @agent-relay/* workspace deps from 8.2.0 → 8.7.2 so npm ci resolves them to the local workspace symlinks instead of installing stale registry copies. This was shadowing the local @agent-relay/cloud build and hiding cloudWorkerStateDir. 3. package-lock.json: regenerated to remove the nested packages/cli/node_modules/@agent-relay/{cloud,config,sdk,...}@8.2.0 entries that were installed from npm. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(evals): re-apply rk_live_ substring fix after prettier bot revert The prettier bot ran on the pre-fix commit and its auto-format push overwrote the rk*live* → rk_live_ fix. Reapplied: contentIncludes does literal substring matching, not glob, so rk_live_ is correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: auto-format with Prettier * fix(evals): use exact mock key in workspace contentIncludes check Replace the glob-style rk*live* (reverted by prettier bot) with the full exact key rk_live_ws_eval_create that the mock actually returns. This is a precise substring check, passes prettier with no diff, and won't be overwritten by the prettier auto-format workflow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lock): remove stale broker@8.2.0 entries from packages/cli/node_modules When CLI's @agent-relay/* deps were bumped from 8.2.0 → 8.7.2, the old @agent-relay/broker-*@8.2.0 optional entries left by the previous npm install were not cleaned by --package-lock-only. These shadowed the root workspace broker binaries (8.7.2), causing 'unexpected argument --instance-name' from the old CLI init interface. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Proactive Runtime Bot <agent@agent-relay.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com> Co-authored-by: Will Washburn <will.washburn@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

claude and others added 23 commits December 21, 2025 07:47

feat: Add detached mode for long-running agent sessions

d540dc1

Add -d/--detach flag to start agents in background, allowing SSH users to disconnect without losing agent sessions. Includes attach/kill commands for session management.

test: Add tests for attach, kill commands and detach flag

5320172

fix(test): Check for .project marker instead of dataDir in listProjec…

5b7d60e

…ts test The test was checking if dataDir exists, but listProjects() requires the .project marker file to be present.

chore: Hide internal --_daemon flag from CLI help

ff51bdb

Merge branch 'claude/analyze-mcp-agent-mail-IXbNF' of github.com:khal…

75b43e9

…iqgant/agent-relay into claude/continue-pr-8-7tWzb

Merge branch 'main' of github.com:khaliqgant/agent-relay into claude/…

ed171e3

…continue-pr-8-7tWzb

Merge pull request #15 from khaliqgant/claude/continue-pr-8-7tWzb

bbcb4e9

feat: Add detached mode for long-running agent sessions

Merge main into feature branch

76cce4e

khaliqgant pushed a commit that referenced this pull request Dec 30, 2025

Add hooks API proposal with trajectory integration

6b01908

Port the hooks API design document from PR #8 with additional trajectory integration examples showing how hooks can work with the PDERO paradigm and trail CLI.

khaliqgant mentioned this pull request Jan 7, 2026

Add comprehensive multi-server architecture document #91

Merged

khaliqgant closed this Jan 7, 2026

willwashburn deleted the claude/analyze-mcp-agent-mail-IXbNF branch May 15, 2026 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add federation proposal for cross-server agent communication#8

Add federation proposal for cross-server agent communication#8
khaliqgant wants to merge 23 commits into
mainfrom
claude/analyze-mcp-agent-mail-IXbNF

khaliqgant commented Dec 21, 2025

Uh oh!

khaliqgant commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

khaliqgant commented Dec 21, 2025

Uh oh!

khaliqgant commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants